Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2 - Performing your First Transformations")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Transformation

(
    pets
    .withColumn('birthday_date', F.col('birthday').cast('date'))
    .withColumn('owned_by', F.lit('me'))
    .withColumnRenamed('id', 'pet_id')
    .where(F.col('birthday_date') > datetime(2015,1,1))
).toPandas()
pet_id breed_id nickname birthday age color birthday_date owned_by
0 2 3 Argus 2016-11-22 10:05:10 10 None 2016-11-22 me
1 3 1 Chewie 2016-11-22 10:05:10 15 None 2016-11-22 me

What Happened?

  • We renamed the primary key of our df
  • We truncated the precision of our date types.
  • we filtered our dataset to a smaller subset.
  • We created a new column describing who own these pets.

Summary

We performed a variety of spark transformations to transform our data, we will go through these transformations in detailed in the following section.

results matching ""

    No results matching ""